Credit Card Users Churn Prediction

Background

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

Objective:

Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards.

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Data Context

Customer details:
  1. CLIENTNUM: Client number. Unique identifier for the customer holding the account
  2. Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  3. Customer_Age: Age in Years
  4. Gender: Gender of the account holder
  5. Dependent_count: Number of dependents
  6. Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  7. Marital_Status: Marital Status of the account holder
  8. Income_Category: Annual Income Category of the account holder
  9. Card_Category: Type of Card
  10. Months_on_book: Period of relationship with the bank
  11. Total_Relationship_Count: Total no. of products held by the customer
  12. Months_Inactive_12_mon: No. of months inactive in the last 12 months
  13. Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
  14. Credit_Limit: Credit Limit on the Credit Card
  15. Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  16. Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  17. Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  18. Total_Trans_Ct: Total Transaction Count (Last 12 months)
  19. Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
  20. Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
  21. Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Importing all the required packages for the project:

Read in the dataset using pandas

To assess the data set (Top 10)

Its always better to check the random rows instead of top rows.

The dataset looks consistent with the description provided in the Data Dictionary. Lots of data have values which need to be processed so the data can be effectively used for analysis

Check the shape of the provided data

Check datatype count

Get the complete info of the dataset

Checking for missing values in the data

Check for duplicate data

Number of unique values in each column

Describe all columns

  1. There are There are 10127 rows and 21 columns.(Int64 : 10 col, Float64: 5 col and Object 6)
  2. There are missing values which may need to be treated.

EDA

Data Analysis: Univariate Analysis

Almost 16% of customers have attrited.

Age has an approx symmetric distribution with mean and median almost nearby There are not many outliers in the distribution for this variable

There are 52.9% Females and 47.1% Males

More than 91% customers have dependents

There are 14.7% customers uneducated. There are many missing values that need to be investigated.

There are many missing values that need to be investigated.

Most of the customers have income less than $40K. ABC value need to be investigated.

93% of customers have blue card.

Almost 25% of candidates have 36 month engagement with bank. Need to understand what is unique about 36 months engagement and why its higher than all other values.

Almost 80% of candidates have more than 2 products with the bank.

Max inactivity in the data is 6 months.

Approx 80% of candidates have 3 or less number of contracts.

The credit limit has high variation and right skewed.

Data Analysis: Bivariate Analysis

The pair plot between the values in dataset show the overall distribution against each of the columns. This give an overall view against the selected dataset values.

Age of customer does not appear to be impacting the attrition

Female customer are more who attritioned but it seems proportional to the number of female customers

No of dependents and attrition has not very strong correlation

No of dependents and Education level has not very strong correlation. At Post Graduate Level and doctorate level the attrition % is higher

For Divorced the attrition % is litter higher than others

Attrition % is higher for income groups greater than 40K. Significance of ABC need to assessed and treated accordingly.

Insights based on EDA

  1. Almost 16% of customers have attrited.
  2. Age has an approx symmetric distribution with mean and median almost nearby
  3. There are not many outliers in the distribution for this variable
  4. There are 52.9% Females and 47.1% Males
  5. More than 91% customers have dependents
  6. There are 14.7% customers uneducated. There are many missing values that need to be investigated.
  7. There are many missing values in marital status that need to be investigated.
  8. Most of the customers have income less than $40K. ABC value need to be investigated.
  9. 93% of customers have blue card and is most popular card
  10. Almost 25% of candidates have 36 month engagement with bank. Need to understand what is unique about 36 months engagement and why its higher than all other values.
  11. Almost 80% of candidates have more than 2 products with the bank.
  12. Max inactivity in the data is 6 months.
  13. Approx 80% of candidates have 3 or less number of contracts.
  14. The credit limit has high variation and right skewed.
  15. Age of customer does not appear to be impacting the attrition
  16. Female customer are more who attritioned but it seems proportional to the number of female customers
  17. No of dependents and attrition has not very strong correlation
  18. No of dependents and Education level has not very strong correlation. At Post Graduate Level and doctorate level the attrition % is higher
  19. For Divorced the attrition % is litter higher than others
  20. Attrition % is higher for income groups greater than 40K. Significance of ABC need to assessed and treated accordingly.

Data Pre-Processing

Missing Value Treatment

Building the model (using KFold and cross_val_score)

Performance of all the models are similar.

Model building Oversampled data

Model building Undersampled data

Hyperparameter Tuning

Hyperparameter tuning using AdaBoost Classifier

Adaboost

GridSearchCV

Model performance is good

RandomizedSearchCV

Performance is similar to the GridsearchCV

XGBoost

GridSearchCV

RandomizedSearchCV

Comparing all models

Xgboost with random search has best performance.

Performance on the test set

The model is not overfit and is working well on the test data.

Total_Ct_Chng_Q4_Q1, Total_Revolving_Bal, Total_Trans_Ct, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon and Total_Trans_Amt are most important features as per the selected model

Create Model Using Pipeline

Business Recommendations

  1. Total_Ct_Chng_Q4_Q1, Total_Revolving_Bal, Total_Trans_Ct, Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon and Total_Trans_Amt are most important features as per the selected model
  2. XGBoosts Tuned with Random search gives the best performance
  3. Bank should provide incentives for the card holder to keep spending and as inactivity for long period can impact the attrition possibility
  4. There are more female customers than male. Marketing team need to find the reason behind the less number of male customers.
  5. Most popular card is blue card. Band need to find out why the other cards are not very popular.
  6. The relationship count can impact as dependent card holders can have good influence on the attrition rate